Continuous incident triage for large-scale online service systems

Chen, Junjie; He, Xiaoting; Lin, Qingwei; Zhang, Hongyu; Hao, Dan; Gao, Feng; Xu, Zhangwei; Dang, Yingnong; Zhang, Dongnei

Title: Continuous incident triage for large-scale online service systems
Creator: Chen, Junjie; He, Xiaoting; Lin, Qingwei; Zhang, Hongyu; Hao, Dan; Gao, Feng; Xu, Zhangwei; Dang, Yingnong; Zhang, Dongnei
Relation: 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). Proceedings: 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) (San Diego, CA 11-15 November, 2019) p. 364-375
Publisher Link: http://dx.doi.org/10.1109/ASE.2019.00042
Publisher: Institute of Electrical and Electronics Engineers (IEEE)
Resource Type: conference paper
Date: 2019
Description: In recent years, online service systems have become increasingly popular. Incidents of these systems could cause significant economic loss and customer dissatisfaction. Incident triage, which is the process of assigning a new incident to the responsible team, is vitally important for quick recovery of the affected service. Our industry experience shows that in practice, incident triage is not conducted only once in the beginning, but is a continuous process, in which engineers from different teams have to discuss intensively among themselves about an incident, and continuously refine the incident-triage result until the correct assignment is reached. In particular, our empirical study on 8 real online service systems shows that the percentage of incidents that were reassigned ranges from 5.43% to 68.26% and the number of discussion items before achieving the correct assignment is up to 11.32 on average. To improve the existing incident triage process, in this paper, we propose DeepCT, a Deep learning based approach to automated Continuous incident Triage. DeepCT incorporates a novel GRU-based (Gated Recurrent Unit) model with an attention-based mask strategy and a revised loss function, which can incrementally learn knowledge from discussions and update incident-triage results. Using DeepCT, the correct incident assignment can be achieved with fewer discussions. We conducted an extensive evaluation of DeepCT on 14 large-scale online service systems in Microsoft. The results show that DeepCT is able to achieve more accurate and efficient incident triage, e.g., the average accuracy identifying the responsible team precisely is 0.641~0.729 with the number of discussion items increasing from 1 to 5. Also, DeepCT statistically significantly outperforms the state-of-the-art bug triage approach.
Subject: incident triage; online service system; deep learning
Identifier: http://hdl.handle.net/1959.13/1448311
Identifier: uon:43375
Identifier: ISBN:9781728125091
Identifier: ISSN:1938-4300
Language: eng
Reviewed

Hits: 993
Visitors: 993
Downloads: 0

		Thumbnail	File	Description	Size	Format